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Abstract 

Motivation: 

Recent advances in experimental techniques have generated large amounts of protein interaction data, producing net- 
works containing large numbers of cellular proteins. Mathematically sound and robust foundations are needed for 
extensive, context-specific exploration of networks, integrating knowledge from different specializations and facili- 
tating biological discovery. 

Results: 

Extending our earlier work, we present a theoretical construct, based on random walks, for modelling information 
channels between selected points in interaction networks. The software implementation, called ITM Probe, can be 
used as a network exploration and hypothesis forming tool. Through examples involving the yeast pheromone re- 
sponse pathway, we illustrate the versatility and stability of ITM Probe. 

Availability: 

|www.ncbi.nlm.nih.gov/CBBresearch/qmbp/itm_probe| 
Contact: 

yyu@ncbi.nlm.nih.gov 



*to whom correspondence should be addressed 



1 Introduction 



Protein interaction networks are presently the subject of intensive and wide-ranging research (Ba der et all 12008). 



Recently, a number of authors have applied the concepts of information flow and random walk s to extract biologicall y 
relevant information from protein interaction networks (INabieva et al. , 120051 iTu et all 120061: ISuthram et all 12 008). 
We h ave also developed a mathematical framework for information flow in interaction networks ( Stoimirovic and Yui 



2007). In this paper, we present a major extension of our previous framework, which permits natural modelling of 



channeled flow through networks. 

Our framework models information flow in interaction networks through discrete random walks. Each random 
walk simulates a possible information path. A walker starts at a node in the network and at each discrete time step 
moves randomly to a neighboring node. The probability of moving to a particular neighboring node is proportional to 
the weight of the link from the present node to it. Unlike the classical random walks, our model allows the walker a 
certain probability to dissipate, that is, to leave the network at each step. Each walk terminates either by dissipation or 
by reaching a destination node. 

The software implementation, founded upon our theoretical ideas, can in principle integrate existing partial knowl- 
edge to form a broad picture of possible communication paths in cellular processes. Hence, it can be used as a 
hypothesis forming tool and may help in delineating possible directions for future experiments. The salient feature of 
our method is its ability to provide context-specific analysis. 

This paper is stuctured as follows. Section|2]presents an overview of our previous work and introduces the notion 
of a channel tensor. Section [3] describes the software implementation of the theory, called ITM Probe. Section [4] uses 
the yeast pheromone response pathway to exhibit several aspects of ITM Probe and is followed by discussion and future 
directions in Sections|5]and|6l with more technical details in the Appendix. 



2 Theory 

2.1 Preliminaries 

We will closely follow the notation from our earlier paper dStoimirovic and "^ 120071) . expanding on it where nec- 
essary. We represent an interaction network as a weighted directed graph T = (V, E, w) where V is a finite set of 
vertices of size n, E C V x V is a set of edges and w is a non-negative real-valued function on V x V that is positive 
on E, giving the weight of each edge (the weight of non-existing edge is defined to be 0). Assuming an ordering of 
vertices in V, we represent a real-valued function on V as a state (column) vector (p 6 R™ and the connectivity of T 
by the weight matrix W where W%j = w(i,j) (the weight of an edge from i to j). If T is an unweighted undirected 
graph, W is the adjacency matrix of T where 



2 if i = j and (i, i) 6 E, 




1 



if i ^ j and (i, j) e E, (1) 



if (i,j)#E 



We do not make distinction between a vertex v £ V and its corresponding state given by a particular ordering of 
vertices. Denote by P the n x n matrix such that 



a 



Pi3 = ^T^> (2) 



where ai G (0, 1] for all i. 

When oli = 1 for all i, the matrix P is a transition matrix for a random walk or a Markov chain on T: for any 
pair of vertices i and j, Pij gives the transition probability from vertex i to vertex j in one time step. In the general 
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case, the node-specific damping factors cti model dissipation of information: at each step of the random walk there 
is a probability that the walk leaves the graph. The value aj measures the likelihood for the random walk leaving the 
vertex i to remain in the graph. 



2.2 Emitting and absorbing modes 

We aim to discover the properties of information flow through a given network by examining the paths of discrete 
random walks. A random walker starts at an originating node, chosen according to the application domain, and 
traverses the network, visiting a node at each step. Each walk terminates at an explicit boundary vertex or due to 
dissipation, which is modelled as reaching an implicit (out-of-network) boundary node. 

We distinguish two types of boundary nodes: sources and sinks. Sources emit information, that is, serve as the 
origins of random walks. All information entering a source from inside the network is dissipated, so a walker is not 
allowed to visit the source more than once. Sinks absorb information, serving as destinations of walks; information 
leaving each sink is dissipated. The network graph together with a set of boundary nodes and a vector of damping 
factors a provides the context for the information flow being investigated. 

The main variable of interest is the (averaged) number of times a vertex is visited by a random walk given the 
context. Let D denote the set of selected boundary nodes, let T = V \ D and let m = \T\. Assuming that the first 
n — m states correspond to vertices in D, we write the matrix P in the canonical block form: 

P=\l DD l DT }- (3) 

fTD "TT 

Here Pab denotes a matrix giving probabilities of moving from A to B where A, B stand for either D or T. The 
states (vertices) belonging to the set T are called transient. 



2.2.1 Absorbing mode 

Suppose that the boundary set D consists only of sinks. Let F denote an m x (n — m) matrix such that is the 
total probability that the information originating at i 6 T is absorbed at j G D. The matrix F is found by solving the 
discrete Laplace equation 

(I-P TT )F = P Ti3 , (4) 

where I denotes the identity matrix. The matrix A = I — Ptt is known as the discrete Laplace operator of the matrix 
Ptt- If I — Ptt is invertible, Equation © has a unique solution 

F = GP TD , (5) 

where G = (I - Ptt) -1 - 



2.2.2 Emitting mode 

Now consider the dual problem where D is a set of sources. Let H denote an (n — m) x m matrix such that ify is the 
total expected number of times the transient vertex j is visited by a random walk emitted from source i (for all times). 
Again, H is found by solving the discrete Laplace equation 

H(I-Ptt) =Pdt- (6) 
which, if I — Ptt is invertible, has a unique solution 

H = P_dtG. (7) 



It is easy to show dStoimirovic and Yul 120071) that the matrix G = (I — Ptt) -1 (also known as the Green's 
function) exists if a% < 1 for all i. This condition is not even necessary if the underlying graph T is connected. The 
entry Gy represents the mean number of times the random walk reaches vertex j € T having started in state i £ T. 
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2.2.3 Interpretations 



If we assume that a random walk deposits a fixed amount of information content each time it visits a node, we can 
interpret TJy is the overall amount of information content originating from the source i deposited at the transient 
vertex j. Furthermore, we can interpret Fij as the sum of probabilities (weights) of the paths originating at the vertex 
i G T and terminating at the vertex j G D that avoid all other nodes in the set D, and Hij as the sum of probabilities 
(weights) of the paths originating at the vertex i G D and terminating at the vertex j G T, also avoiding all other 
nodes in the set D. Each such path has a finite but unbounded length. However, unlike Fy, H.^ does not represent a 
probability because the events of the information being located at j at the times t and t' are not mutually exclusive (a 
random walk can be at j at time t and revisit it at time if). For F^, the absorbing events at different times are mutually 
exclusive. 



2.3 Channel mode 



In the absorbing and emitting modes as above, the sources and sinks are kept separate, that i s , the a bsorbing mode 
only supports sinks while the emitting mode only supports sources. In dStoimirovic and Yul [2007) we introduced 
pseudosinks to the emitting mode to overcome this problem. Pseudosinks are transient nodes that dissipate all (or 
most) walks leaving them. An additional practical difficulty in using the above modes is that very few walks reach the 
selected destinations (i.e. sinks or pseudosinks), most being dissipated. To direct the flow towards the destinations, 
we used the potential functions to redistribute the weights of edges of the original graph so that paths moving closer to 
the destinations are favored over those moving away from them. While our model provided very reasonable results on 
many examples from yeast protein-protein interaction networks, it also suffered from several problems. The choice of 
the potentials was purely empirical since there was no theoretical foundation for any particular form they should take. 
Furthermore, the underlying graph is slightly different for each choice of the boundary and destination sets, hindering 
rapid computation at large-scale. 

A way to overcome the above problems is to combine the absorbing and the emitting modes in such a way that the 
absorbing probabilities are used to select the paths from sources to sinks. 

From now, assume V = S U T U K, where the set S denotes the sources, K denotes the sinks and T the transient 
nodes and write the matrix P in the block form as 



(8) 



We will modify (add context to) the underlying graph V so that the random walk can only leave the sources and only 
enter the sinks (no communication is allowed between sources or between sinks). The modified transition matrix, 
denoted P has the form 

P S T P S K 

oo o 



P.SS 


PsT 


P S K 


Pts 


Ptt 


Ptk 


Pks 


Pkt 


P KK 



(9) 



Treating the vertices in S and T as transient for the absorbing mode in |2.2.1| we first derive the matrix F (of size 
\SUT\x\K\): 



F = I- 



" 


P S T ' 


r 


' P S K ' 





Ptt 









P S K 



PstG 
G 



P S K 

Ptk 



+ PstGPtk 
GPta 
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where, as before, G = (I — Ptt) 1 ■ 

Similarly, treating the vertices in T and K as transient for the emitting mode in 12. 2. 21 we derive the matrix H (of 

size 151 x \TUK\): 



H = [ P 



ST rsK 



Ptt Ptk 




PsT PSK 

Pg T G PstGPtk + Psk 



G GP TK 
I 



The entries of F and H are, as before, interpreted as probabilities of absorption at sinks and average numbers of 
visits of walks emitted from sources, respectively. Note that the same Green's function, G = (I — Ptt) -1 , needs to 
be computed for both solutions. Also note that the 'S" rows of F form the transpose of the 'K' columns of H, that is, 
for all s 6 S and k e K, F s k = H s k . 

The matrices F and H can be extended into the matrices F and H, of sizes n x \K\ and |5| x n, respectively (i.e. 
extended over the whole graph) by explicitly setting the S and K portions to the identity matrices: 



F— [ Psk + PstGPtki GPtk, I 
H= I, P ST G, P st GPtk+Psk 



(10) 
(ID 



In this way, we explicitly describe our assumptions that a random walk originating at each source can only leave it and 
not return to any source and that any random walk originating at a sink has probability 1 of staying there. 



2.3.1 Channel tensor 

Define the channel tensor $ € V ® K (g) S* by 



</>: 



H S iF ik . 



(12) 



It can be easily shown that the value of <Pf k gives the expected number of times a random walk emerging from the 
source s and terminating at the sink k visits the vertex i. Furthermore, from Equations dTDT - FTTb . one can easily see that 
for all s G S and k e K, 

d3) 

which supports the above interpretation. The values of ^ depend on the values of the damping factors a.j for all j £ V. 

Let "fs = Ylk'eK k' denote the expected number of times a random walk emerging from the source s reaches 
any of the sinks. Define the normalized channel tensor, 3* G V <S> K <S> S* by 



i.k 



(14) 



The normalized channel tensor ^ k gives the expectation of the number of visits of i by a random walk from s to k, 
conditional on the random walk being terminated at sinks only. Equations ( [T3l[T4"i i imply that for all s E S, 



k£K 



keK 



s,k 



i. 



(15) 



It can be shown that for undirected graphs, with a context consisting of a single source and a single sink, the 
values of 4> are invariant under interchange of sources and sinks. In general, however, reversing sources and sinks 
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gives a different result, both due to asymmetry of the weight matrix in directed graphs and because sources and sinks 
have different roles if more than one of each are present. Random walkers originating from different sources can 
simultaneously visit a transient node while a walk can terminate only at a single sink. Thus, the sinks split the total 
information flow, that is, compete for it, while sources interfere, either constructively or destructively (see |2.3.2l below). 

The damping coefficients influence the normalized channel tensor differently from the non-normalized one or the 
absorbing and emitting solutions. For the non-normalized versions, damping factors control the amount of information 
reaching the boundary and any intermediate points. In the normalized case, we consider only the information having 
reached the sinks .The damping factors determine the paths taken by random walks as they move from sources to 
sinks. Small values strongly favor the nodes on the shortest paths, while large values allow random walks to take 
longer paths and hence favor the vertices with many connections. Appendix[A]contains a more detailed analysis of the 
role of damping factors in the channel mode. 



2.3.2 Summary functions 

The (normalized) channel tensor provides a detailed view of information flow between sources and sinks within a 
context. However, for practical applications, it is sometimes desirable to reduce the amount of available information to 
a single vector over V, which can be tabulated and graphically visualized usin g color maps. The summar y quantities 



that we present below are based on those developed for the emitting model in dStoimirovic and Yul 120071) with some 
modifications. 

The source specific content of a normalized channel tensor denoted is a quantity that for each node i e V 
assigns the total number of visits of a random walk originating from a source s G S, that is, 



keK 



(16) 



The total content of 4>, denoted by r, sums all visits for each node: f, = X^eS fit- r ^ ne conce Pt °f destructive 

interference measures at each node the overlap between visits from different sources. We define the interference 
vector <r over V by 

&t = \S\iwnfc. (17) 

Hence, <jj gives the total number of times the random walks from all sources co-occur at each node (scaled by the 
number of sources). 

With damping factors less than unity, the random walks from sources to sinks effectively visit a small portion of 
the entire underlying network. Information Transduction Module or ITM is a notion that we use to describe the set of 
nodes most influenced by the flow. The nodes are ranked using their values for the total content or interference and the 
most significant nodes are selected. The num ber of selected nodes depen ds on the application-specific considerations 



but we found that the participation ratio tt dStoimirovic and Y u. 2007) of the total content vector r gives a good 



estimate of the number of nodes whose relative amount of content is significant. It is given by the formula 



(£.. 



Ti 



TT(f) = v _ (18) 



3 ITM Probe 

ITM Probe is a software implementation of our theoretical framework, written in the Python programming language and 
available as a standalone library and as a web service. Its main utility is to extract information from protein interaction 
networks by using partial and local information (pairwise interactions), possibly coming from a variety of sources, to 
obtain a context-specific overview of possible information flow patterns. 
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At present, ITM Probe provides three models, channel, absorbing and emitting, corresponding to the modes from 
the previous section. The results are based on the source specific content for the channel and emitting model and on 
absorbing probabilities for the absorbing models. The web service has, due to interface considerations, more limited 
parameter sets: the damping factor is assumed constant for all nodes (henceforth denoted a) and the numbers of 
sources and sinks are limited. 



3.1 Input parameters 

ITM Probe requires as input the model, an interaction graph and the context parameters (sources, sinks and damping 
factors). The web service channel model allows up to three independent contexts to be specified, allowing investigation 
of the overlap of independent channels through interference. 

At present, the web service supports only the yeast physical interaction networks derived from the BIOGRID 
( Star k et al. , 2006) database - we intend to eventually make available the networks from other model organisms. Our 



network graphs consist only of those interactions that are from low-throughput experiments (that is, from publications 
reporting less than 300 interactions) or are reported by at least two independent publications. We provide a version of 
the yeast network where the interactions labelled with 'Biochemical activity' are considered as directed links (bait — » 
prey) as well as a gr aph where all edges a re undirected. 

It is well known (iSteffen et al.[ |2002|) that proteins with a large number of non-specific interaction partners might 
overtake the true signaling proteins in the information flow modeling. Therefore, ITM Probe allows users to specify 
nodes to exclude from the network. For the yeast network the nodes excluded by default include cytoskeleton proteins, 
histones and chaperones, since they may provide undesirable shortcuts. 

The emitting model also allows setting the destinations of information and the potentials that modify the network 
to attract the flow from sources to them. These options are now rendered obsolete by the channel model and serve only 
to illustrate the concepts from our earlier paper. The potentials directing the flow towards sinks can also be set for the 
absorbing model. 



3.2 Output 

The output of the web version of ITM Probe consists of the graphical representation of the resulting ITM, query-related 
summary statistics and a table listing the top ranking members of the ITM (Fig. [TJ. Each row of the table contains 
the protein name, the associated model values and the links to selected databases containing the descriptions of the 
protein. 

The graphical representation shows the sub-network of most significant nodes from the resulting ITM, laid out 
according to a layout seed (every seed produces a different drawing of the same sub-network). The table shows only 
the nodes shown in the graphical display but the full model solution can be downloaded in the CSV format for further 
analysis. 

The nodes comprising an ITM are selected according to the total content (emitting and channel models) or the 
total likelihood of reaching a sink (absorbing model). The number of nodes to be shown is either specified directly or 
determined through a criterion such as participation ratio or the cutoff value. 

Each graphical layout can be rendered and saved in multiple formats (SVG, PNG, JPEG, EPS and PDF). Images of 
large sub-networks in SVG format can be navigated online using the 'Network Navigator' applet. For each rendering, 
the users can choose which aspects of results to display, the color map and the scale for presentation (linear or logarith- 
mic). When multiple boundary points are specified, it is possible to obtain an overview of all flows simultaneously by 
selecting the color mixture scheme (Fig.[T|). In this case, each source (in the channel or emitting model) or sink (in the 
absorbing model) is assigned a basic CMY (cyan, magenta or yellow) color and the coloring of each displayed node 
is a result of mixing the colors corresponding to its source- or sink- specific values for each of the boundary points. 



6 



Graph: BioGRID-2 43 Yeast Directed (3680 nodes. 43934 links) 

Contest fl: [STE2 . CDC42] -> [STE12] (Dissipation^ -01) 

Excluded nodes: 165 



GRAPHIC SUMMARY 




SUMMARY 



Qoantny V^ue 

Vists Participation Ratio 239.1 7 

Total Nodes Visaed 49.98 

' :. 0.21 

Total Interference 33.75 

Avg path length trom STE2(1) 24.7307 

Avg path length trom CDC42(1) 23.2211 



TOP SCOPJUG IIOUES (PER SOURCE! 

Rank Mode STE2(1) CDC42(1) Interference Total tints 

1 STE12 1 4 2- 2 CYGD NCBI SGD 

2 STE2 10 1 CYGD NCH SGD 

3 CDC42 1 1 CYODNCBISGD 



Figure 1: A partial screenshot of the results of the ITM Probe web service. 



4 Examples: Yeast Pheromone Pathway 

For our examples, we use the mating pheromon e response path way in Saccharomyces cerevisiae, the one of the best 



understood signalling pathways in eukaryotes dBardw ell. 2005). The mating signal is transferred from a membrane 



receptor to a transcription factor in nucleus, leading to transcription of mating genes. We will only provide a very brief 



overvi ew of the pathway necessary for discussing our examples; more details are available in the review by Bardwell 



(2005) 



A mating pheromone binds the transmembrane G-protein coupled pheromone receptors Ste2p/Ste3p. This leads 
to dissociation of Ste4p and Stel8p, the membrane bound subunits of the G-protein complex, which also contains the 
subunit Gpalp. Ste4p then binds to the protein kinase Ste20p, which is recruited to the membrane through Cdc42p, 
and the scaffold protein Ste5p. On the scaffold, a MAPK (mitogen activated protein kinase) cascade occurs, where 
each protein kinase in a cascade is activated by being phosphorylated by the previous kinase and in turn activates the 
next protein. In this case, the cascade goes Ste20p — > Stel lp — > Ste7p — > Fus3p or Ksslp. The final activated kinase 
Fus3p or Ksslp then migrates to the nucleus where it phophorylates the proteins Diglp and Dig2p, the repressors of the 
Stel2p transcription factor activity. The Stel2p transcription factor can then, in coordination with other transcription 
factors such as Teclp, promote the transcription of the mating genes. 

As the underlying network, we use the BIOGRID-2. 0.43 interactions, filtered in favor of low-throughput inter- 
actions as described in 13.11 The physical binding interactions are given a weight 1.0 in both directions while the 
biochemical activity interactions are interpreted as directional and given a weight of 2.0 (where both physical and bio- 
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Figure 2: Yeast pheromone response ITMs: in each case we report the context described as {sources}^{sinks} (a). 
The number of nodes shown is determined by the participation ratio in all cases except (d), where only 40 top ranking 
nodes are shown due to space constraints. The parts (a), (e) and (f) are colored by the total content, (b) and (d) by 
source specific content (color mixture) and (c) by interference. Contexts: (a) {Ste20p}^{Stel2p} (0.85); (b) {Ste2p, 
Cdc42p}^{Stel2p} (0.85); (c) {Ste2p, Cdc42p}->{Stel2p} (0.85); (d) {Ste2p, Cdc42p}- >{Stel2p, Gal4p, Ino4p, 
Ume6p, Yaplp, Raplp} (0.85); (e) {Ste2p,Cdc42p}->{Stel2p} (0.55); (f) {Stel2p}->{Ste2p,Cdc42p} (0.85). 
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chemical interactions are reported, only biochemical are considered). We used a set of 165 excluded nodes, consisting 
of histones, chaperones and cytoskeletal proteins. We found that the excluded nodes do not strongly affect the results 
of the particular examples presented here. 

Fig.|2a) focuses solely on the MAPK cascade portion of the pheromone pathway, with Ste20p as a single source 
and Stel2p as a single sink. Selection of top proteins by participation ratio captures all important participants of the 
cascade plus the proteins that do not play a direct role but interact with the most relevant ones (Slt2p, Cln2p). To 
obtain an ITM best describing the entire pheromone response pathway, it is necessary to include two sources, the 
receptor Ste2p and the membrane-bound protein Cdc42p (Fig. |2jb,c)). Including only Ste2p is not sufficient because 
of the shortcut through the link Gpalp — > Fus3p, which avoids the MAPK cascade. Furthermore, inclusion of Cdc42p 
is biologically sensible because Cdc42p activates Ste20p and is hence necessary for the MAPK cascade. Interference 
between the information flows from Ste2p and Cdc42p (Fig.|2jc)), rather than total visits, highlights the most important 
proteins in the pathway. 

The channel model is relatively robust to addition of non-relevant sinks to its contexts. In Fig. |2jd), we added 
five more transcription factor proteins as sinks, in addition to Stel2p. As can be seen, the most visited nodes mostly 
belong to the channel to Stel2p while the remaining sinks are linked to sources by weaker channels (mostly not shown 
because the figure shows only the top 40 nodes). In this case, Stel2p has 0.77 total visits (out of 2) with interference 
of 0.74. The remaining 1.23 visits are distributed among the other five sinks. 

The importance of the choice of damping factors has been emphasised in the theoretical sections and in Appendix 
lAl For example, setting a = 0.99 with Cdc42p and Ste2p as sources and Stel2p as a sink, as shown in the screenshot 
from Fig. Q] leads to participation ratio of 239 and average path lengths of about 24 (the values are similar for both 
sources), that is, a random walk emanating from one of the sources visits on average 23 proteins before reaching 
Stel2p (the values for a = 0.85 are 27 for the participation ratio and about 7 for both average path lengths). On the 
other hand, small values of a (Fig. |2fe)) lead to only the shortest paths being significant. The ITM Probe default value 
of a = 0.85 gives, in our experience, the best results in emphasising biologically relevant channels. 

Fig-Eff) shows the effects of reversing sources and sinks. The resulting ITM performs much worse in describing 
the pheromone pathway for both reasons mentioned in 12. 3. II Firstly, the pheromone response pathway is dominated 
by the MAPK phosphorylation cascade, which is in our case modelled by directed links. Secondly, since the sinks 
are competing, most of the information emitted from Stel2p is captured by Cdc42p, leaving little for Ste2p. This 
illustrates different semantic roles for sources and sinks in the channel mode that need to be taken into account when 
using ITM Probe. 

5 Discussion 

Applied to protein interaction networks, ITM Probe can be used for exploration of network graphs and as a hypothesis 
forming tool. The absorbing and emitting models can be used to glimpse the network neighbourhoods of selected 
nodes and hence provide a good view of protein complexes. The ITMs obtained by absorbing and emitting models 
centered at the same nodes are slightly different: the absorbing model can reveal relatively distant 'leaf nodes linked 
to a sink by a nearly unique path while the emitting model favors highly connected clusters due to the 'resonance' 
effect they induce: a random walk enters a cluster and visits many of its nodes several times before leaving it. 

The channel model can be employed for discovery of potential pathways linking certain proteins or biological 
functions associated with them. Multiple sources (perhaps in multiple contexts) together with interference can be used 
to investigate potential points of crosstalk between information channels, while a solution based on a carefully chosen 
set of competing sinks can propose a plausible biological explanation among several possible ones. 

We emphasize that when using ITM Probe as a discovery tool, it is up to the user to select the model context 
according to the biological context investigated and to verify the results afterwards. The choice of interaction graphs 
and excluded nodes is critical because of sensitivity of random walk frameworks to shortcuts and because interaction 
graphs by necessity do not capture the entire biochemical environment within the cell. Therefore, it is desirable to 
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select high-quality data relevant to the biological process under investigation. For example, a particular link, even if 
thoroughly experimentally verified, may have to be removed from the network if it corresponds to an interaction that 
does not occur in conjunction with other interactions relevant for the selected biological process. 

Us e of damping fa ctors gives an advantage to our frame work over sim i lar m odels proposed based on random 
walks dTu et all l2006f) . or equivalently, electrical networks dSuthram et all 120081) . Dissipation, controlled by the 
user, focuses random walks to the portion of the network relevant to the context, preventing visits to distant nodes. 
Computationally, obtaining a solution involves simp ly a solut i on to a (sparse) system of linear equations, without 
necessity for any simulations of random walks as by ITu et al.\ d2006l) . For large scale computations using the same 
graph and damping factors but different boundaries, it is possible to reuse a previously computed solution (Appendix 
IB1 for computation of a new solution. This technique can lead to a significant speed-up for large networks. 



6 Future directions 

The mathematical framework presented here is flexible and can be applied and extended in a number of ways. We 
intend to apply it to physical interaction networks from more species, such as Homo sapiens and, beyond that, to 
metabolic or genetic networks, or a combination of networks of different types. 

In another direction, we plan to extend the framework so that it is able to take into account additional biological 
information, if available. Such data includes protein complex memberships, activation states and their triggers as well 
as protein abundances. While these extensions may lead to nonlinear models, our aim is to preserve the simplicity of 
the present framework while improving accuracy. We see ITM Probe primarily as a discovery and hypothesis forming 
tool, rather than a detailed model of the cell. 
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Appendix 

A The role of the damping factor in the channel mode 

Let W be a weight matrix for a weighted connected graph T = (V, E, w) and let a be a vector of positive coefficients 
over V such that a>i < 1 for all i E V. Let /i = max^ a, and let = onj pk. Then any transition matrix P for T with 
dissipation given by a. is given by P = /iQ, where 

n..- 2 y ( 19) 



for alH, j G V by and < <ij < 1. As in the main text, let S, K and T be sets of sources, sinks and transient nodes, 
respectively, such that V = S U K U T. 

From here onwards, we will consider \i as a free parameter in (0, 1) and the transition matrix P as dependent on 
/i. For this range of /j,, the solution matrices F and H from Equations ( fToWTTl) are well defined, that is, the Green's 
function G = (I — Ptt) 1 = S^Lo ^tt i s well-defined. As fj, J, 0, all the damping factors in a uniformly tend to 
and P — > 0; however, as we will show, the normalized channel tensor is well-defined in the limit as /i — > (provided, 
of course, it is well defined for other values of n). Note that, by construction, HQH^ = 1 and hence the solutions may 
not exist as ^ | 1 and P — > Q. 

A.l Path lengths 

The damping parameter [i controls the distribution of lengths of the paths (or the times) a random walk emitted from 
a source takes before being absorbed at a sink. 

For any s 6 S, let L s denote the random variable giving the length of the path (or a number of steps) taken by a 
random walk originating at s and terminating at any sink. The underlying probability density P(L S = n) is given by 

P(n) i/£ fc ^^< for " = 1 ; (20) 

{E k <eK[PsTP n TT 2 PT K } sk , forn>2. 

We now compute closed forms for the mean and variance of L s and discuss their dependence on fi. We require the 
following identities stated without proof: 

oo 

^(n + 2)PJ T = G 2 + G, (21) 

n=0 
oo 

(n + 2) 2 P £ T = 2G 3 + G 2 + G, (22) 



n=0 



^M 3 Gy=M[G 2 + G] y (23) 
^[G% = 2„[G»] tf . (24) 
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The first moment (the mean) of the distribution of L s , E(L S ), is given by 

E ^ = W E ( P » k ' + f> + 2 ) [PstP^tPtkU,] 

S k'EK \ n=0 ) 

= ^E (P sk ' + [PsT(G + G^P TK ] sk ,) 

s k'EK 



(25) 



ff Fi 



i+EE 



kEKiET s 



while the second moment E(L^) is 

E(L 2 ) = 



jT E (^+E(« + 2) 2 [P5tP? t Ptk]^) 

s k'eK 

J- £ [P S TG 3 P T K] sfe ,+E(L s ). (28) 



k'EK 



The variance of i s can therefore be computed using the standard formula Var(i s ) = E(L 2 ) — E (L s )- The mean 
and the variance are related through /z: 



Lemma 1. Let s e 5. Then, for all [i G (0, 1), 



Var(L s )=/x^-E(L s ). (29) 



Proo/ LetX sfe = [P ST G 2 P T K] sfc . Then, 

9/i 



± E(L S ) = £ % a *" ■ (30) 



Using Equation d24l > we obtain 



= - [P ST G 3 P T if] sfc , (31) 
A* 
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and, by Equations (l23l and (|25T >. 



dfi 



= E ( + E Q^-Q^G^Q 



k'eK 



i,jeT 



- E (^' + [P5T(G + G 2 )P TA -] sfc ,) 

" k'eK 



— E(L S 



(32) 



Therefore, by Equations ( f30b and d28l ). 
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= W E [P«"G»P TJ c] >fc , -E(L.) E ^ 

= E(L 2 S ) - E(L S ) - E(L S )(E(L S ) - 1) 
= Var(i s ). n 



□ 



A direct consequence of Lemma[T]is that E(L S ) is a continuous, increasing function of p 6 (0, 1). It is bounded 
from below: as fi J, 0, the variance of L s vanishes, and, as will be shown in the remainder of this section, the average 
path-length converges to the length of the shortest path from the source to any of the sinks. The normalized channel 
tensor may be undefined for fi = 1 (the Green's function G need not exist) and in this case E(L S ) is unbounded as 
p, | 1. If G is well-defined for p = 1, then E(L S ) is bounded and attains its maximum at /x = 1. The value of the 
maximum depends on the underlying network graph and the particular context. 



A.2 Large dissipation asymptotics 

For all i, j 6 V, let p(i,j) denote the (unweighted) length of the shortest directed path between i and j. We allow 
p(i,j) = oo if there exists no directed path between i and j. It is well-known that p is a (not necessarily symmetric) 
distance that satisfies the triangle inequality, that is, for all i, j, k S V, 



P(i,j) + P(j,k) > p(i,k). 



(33) 



For any source s e S, let K s denote the set of all points k" £ K such that for all j 6 K s and every k <E K \K S , 
p(s,j) < p(s, k). The set K s contains all the sinks that are closest to s. 

Theorem 2. Let s 6 S, i G T and k 6 K such that p(s, i) and p(i, k) are both finite. Then, if k £ K s and i lies on 
the shortest path from s to k, 



lim 

/•40 



14st*4tt 


si 


Htt Htk 


ik 


J2k'eK s 


QstQtt' 1 * 2 Qtk 


sk' 



(34) 



Otherwise, lim^o d>| fc 



0. 



Proof. Let s e S, i £ T and k £ K. Since, p(s, i) and k) are finite, it follows that p(s, k) is also finite, that is, k 
is reachable from s through i and the normalized channel tensor # is well defined for all p £ (0, 1). Recall that 



*f,fc = 



LA' 



[P ST G] s ,[GP™] lfc 



(35) 
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where F sk , = P SK + Ps T GP T if] s fc'- 

Let u, v £ T and let d = p(u, v). It can be easily shown (see Lemma A. 3 from dStoimirovic and Yul 12007 ) for a 
partial proof) that [P xx ] w = for all n < d and [P TT ] . . > 0. Therefore, 

oo oc 
Guv = E i P TT] uv = E ^ [Q?T] M , = [Qt T ] uv + 0(^ d+1 ) 



as p I 0. Hence, 



3'er 



TT 



0(p 



p(j,i)+2 



pf (s ' l) [QstQ 



p(s,i)— 1 
TT 



\GP 



TK ik 



Em 



p(i,i)+i 



QP{iJ) 
TT 



ijk 



0{p p{s ^ +1 ), 



0( m 



p(i,fc)+l> 



(36) 



(37) 



Let £ = p(s, fc"), where fc" £ A",,.. We will consider the denominator of Equation (f35T > under two separate cases, £ = 1 
and£ > 1. 

If £ > 1, for all k' £ A", the vertices s and fe' are not adjacent and thus P s k> = 0. Hence, since s and fc' are 
connected, there exist j, j' £ T such that p(s, fe') = p(s, j) + p(j, j') + p(f, k') = p(j, j') + 2, implying 



[PstGPtkU' 

w'eT 



QpU-j') 
TT 



if 



Q fk> + 0(//("')+ 3 ) 



Therefore, 



'> [QstQtt A: '' > ~ 2 Qtk| + 0(// (s ' fe ' )+1 ) 

L J sfe' 



E Psk + PsTGPrif] s fc' 

fc'G-K" 

V ^ \QstQ T t 2 Qtk] + 0(^ +1 ) 



k'£K a 



and, as /i | 0, 





*4st^Jtt 


S2 


r .p(i,fc)-l r . 






QstQ tt 2 Qtk 





By the triangle inequality and our assumptions on s, i and k, 

p(s,i) + p(i,k) > p{s,k) > 



(38) 

(39) 
(40) 

(41) 
(42) 



The first inequality becomes an equality if and only if i lies on the shortest path between s and k while the second is 
an equality if and only if k £ K s . Therefore, if the assumption of the theorem is satisfied, the value of <Pj k converges 

to the value of the right hand side of Equation ( f34b . while otherwise lim^o k = 0. 

On the other hand, if £ = 1, ^ s — > J2k'eK ^Qsk' + 0(p 2 ) and therefore, since p(s, i) + p(i, fc) > 2, ^| fc — > 
as fi i 0. □ □ 
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We have therefore shown that, as p, | 0, only the nodes associated with the shortest path from each source to the 
sink(s) closest to it will have positive values of the normalized channel tensor - all other entries will be exactly 0. 



Theorem 3. Let s e S and suppose the normalized channel tensor <& is well defined for all p G (0, 1). Then, 



limE(L s ) = p(s, k), 



(43) 



where k £ K s 



Proof. Let s <E S, let k <G K s and let d = p(s, k). For m = 1,2 . . .d — 1, let U s (m) = {i e T : p(s, i) = 
m and p(s, i) + p(i, k) = d}. The set IT s (m) consists of all transient nodes that are at the distance m from s on a 
shortest path from s to any of the sinks closest to s. By Theorem|2] 

k"£KieT 

= E E E 

k"<EK s m=l i6n s (rT 



[QstQ™t ] s i 




1 Qtk 


ik" 




Q5T 


n p(s,fe)- 


2 Qtk 


s k ' 



d-1 



E EE 

k"£K s m=l iGT 



[QstQ 



m— 11 
TT J 



[Qtt™ 1 Qtk] 



E 



^1 Efc'GK s [QsTQTT 2 Q^] sfc , 



Therefore, by Equation d27l i. 

as required. 

A.3 Applications 



HmE(L a ) = 1 + lim ]T 5^*?,fc = 

M k'GKieT 
□ 



□ 



A possible application of the results of this section is to determine the damping factors for a context through specifying 
the average path length of random walks. For example, if a single source s is present, we can write E(X S ) = f(p) 
(where / is given by Equation (l27l i) and solve for any valid value of path length. Since / is continuous and increasing 
on (0, 1), the required (possibly non-unique) value of p, can be easily found numerically. 

If there are multiple sources within the context, one can specify the sum or a weighted average of path lengths of 
walks emitted from all of the sources and then solve for p. Even finer application specific control may be possible by 
constraining damping at specific nodes as well as the general damping factor p. 



B Rapid Evaluation of Submatrix Inverses 



Consider an invertible block matrix M = 



A B 
C D 



where A is a square matrix. It is a well known result of linear 



algebra (see for example Poole (2005), chapter 3) that the inverse of M can be written as 

M" 1 



A 1 + A iBQ iCA 1 
Q *CA 1 



A _1 BQ 1 

Q 1 



(44) 
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where Q = D — CA _1 B. Suppose we are interested in computing matrices of the form A _1 U, where A is very 
large and U is an arbitrary matrix with appropriate number of rows. If it is necessary to perform a large number 
of such computations with different square submatrices A (the matrix M may be permuted in each case to reorder 
the indices), it could be effective to precompute the matrix M _1 (or, computationally more appropriately, its LU- 
decomposition) once and in each case extract the required inverse A -1 through simple and relatively inexpensive 
algebraic manipulations and permutations. 

X Y 

, with the blocks of the same size as in Equation (PPfl) . and observe that W = 



Indeed, write M 
Q 1 and hence YW 



u Z 

*Z = A 



W 



CA . Therefore, 



A" 1 =X-YW" 1 Z, 



(45) 



and thus 

A" 1 U = XU-YW" 1 ZU. (46) 

Since, by our assumption, the matrix W is very small compared to A, computing the inverse of W and applying 
Equation ( l43T > over many evaluations of A~ 1 can be significantly faster than extracting the submatrix A and computing 
its inverse directly each time. The cost of initial computation of M _1 can be amortized over many evaluations of the 
inverse of the submatrices. 

To evaluate of the matrix A _1 U, we compute the matrices YW" 1 and 



ivr 1 



" u " 




X 


Y 




U " 




xu 







z 


W 









zu 



and then produce A 1 U using Equation d46*b . 
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